I used the clean dataset produced in the “Colorado_Fulldataset” document in order to analyze only general public response in Twitter, since this dataset already exluded tweets sent by official agencies and bots. The total number of tweets in the dataset is 3858.
Tweets sent during the Flood and Inmediate Aftermath phases of the disaster were filtered. This means that 45% of the tweets were excluded. This is we will use 2132 tweets from the original 3858 total.
Before any spatial analysis or plotting, the data was first projected in North America Lambert Conformal Conic.
## Coordinate Reference System:
## User input: +proj=lcc +lat_1=20 + lat_2=60 + lat_0=40 +
## lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +
## units=m no_defs
## wkt:
## PROJCRS["unknown",
## BASEGEOGCRS["unknown",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6269]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]]],
## CONVERSION["unknown",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",40,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-96,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",20,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",60,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]]]
One of the goals of this study is to confirm if spatially clustered tweets can serve as a proxy for reports from affected areas. So I will repeat the same analysis done for the whole dataset but now considering only tweets belonging to clusters after the spatial clustering process. A hierarchical implementation of dbscan (hdbscan) was used for the spatial clustering. For hdbscan we need to pick a number of minimum points to be considered to identify the cluster. When setting that number with any value between 159 and 225, clusters in Colorado were identify. So we picked the minimum value in that range which retains the maximum number of tweets: 160.
After the spatial filtering only tweets sent during the Flood stage only 30% of the total tweets were retained. 1172 tweets from the 3858 total.
A quick view of the most common words in the whole dataset:
## # A tibble: 3,027 x 2
## word n
## <chr> <int>
## 1 boulder 734
## 2 boulderflood 355
## 3 cowx 145
## 4 flood 114
## 5 coflood 107
## 6 colorado 105
## 7 rain 89
## 8 flooding 77
## 9 creek 76
## 10 amp 60
## # … with 3,017 more rows
Again, since “boulder” is the most common word and is going to have a big effect in our topic modelling, it was removed from the dataset. The term “boulderflood”) was also excluded because it was so common and used neutrally in all four stages. After excluding the two terms, the new list of common words looks as follows:
## # A tibble: 3,025 x 2
## word n
## <chr> <int>
## 1 cowx 145
## 2 flood 114
## 3 coflood 107
## 4 colorado 105
## 5 rain 89
## 6 flooding 77
## 7 creek 76
## 8 amp 60
## 9 rt 55
## 10 water 52
## # … with 3,015 more rows
Again, after playing with different numbers, I decided to train a topic model with 15 topics. From 16 on, topics started to look very similar (with the same bag of words). Here is a summary of the results after this process:
## Warning: `cols` is now required.
## Please use `cols = c(exclusivity, semantic_coherence)`
Mapping topics to see spatial distribution:
## Reading layer `Flood2013Extents' from data source `/home/marcela/Coding/MarcesThesis/Chapter1/Colorado/GroundTruthData/Flood2013Extents/Flood2013Extents.shp' using driver `ESRI Shapefile'
## Simple feature collection with 16 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 3055883 ymin: 1217916 xmax: 3090147 ymax: 1274267
## projected CRS: NAD83(HARN) / Colorado North (ftUS)
## Reading layer `Flood_2013_Inundated_Areas' from data source `/home/marcela/Coding/MarcesThesis/Chapter1/Colorado/GroundTruthData/Flood_2013_Inundated_Areas/Flood_2013_Inundated_Areas.shp' using driver `ESRI Shapefile'
## Simple feature collection with 165 features and 6 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 3001629 ymin: 1211193 xmax: 3124736 ymax: 1338535
## projected CRS: NAD83(HARN) / Colorado North (ftUS)
## Reading layer `CityofBoulder' from data source `/home/marcela/Coding/MarcesThesis/Chapter1/Colorado/GroundTruthData/Partial_Preliminary_2013_Colorado_Flooding_Extents/CityofBoulder.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -11721990 ymin: 4855973 xmax: -11708360 ymax: 4878435
## projected CRS: WGS 84 / Pseudo-Mercator
## Reading layer `Evans_SouthPlatte_Evans' from data source `/home/marcela/Coding/MarcesThesis/Chapter1/Colorado/GroundTruthData/Partial_Preliminary_2013_Colorado_Flooding_Extents/Evans_SouthPlatte_Evans.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -11659530 ymin: 4916371 xmax: -11641440 ymax: 4928671
## projected CRS: WGS 84 / Pseudo-Mercator
## Reading layer `DamagePolsAllIncluded' from data source `/home/marcela/Coding/MarcesThesis/Chapter1/Colorado/GroundTruthData/StructuresPols/DamagePolsAllIncluded.geojson' using driver `GeoJSON'
## Simple feature collection with 50 features and 8 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -105.5778 ymin: 39.63068 xmax: -102.728 ymax: 40.86519
## geographic CRS: WGS 84